{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ig3hmpIt8ptJ"
      },
      "source": [
        "# Semantic Search with Cohere Embed Jobs"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fc94ylzpucDm"
      },
      "outputs": [],
      "source": [
        "# TODO: upgrade to \"cohere>5\"",
"! pip install \"cohere<5\" hnswlib -q"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Qu9a_jwZ5Zop"
      },
      "outputs": [],
      "source": [
        "import time\n",
        "import cohere\n",
        "import hnswlib\n",
        "co = cohere.Client('COHERE_API_KEY')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Fgqhh1kk8mnY"
      },
      "source": [
        "## Step 1: Upload a dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "kogkZnDEnK8B",
        "outputId": "646b4508-8111-42ad-a9b2-3425da1ae5ff"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "uploading file, starting validation...\n",
            "sample-file-hca4x0 was uploaded\n",
            "...\n"
          ]
        }
      ],
      "source": [
        "# Upload a dataset for embed jobs\n",
        "# This sample dataset has wikipedia articles on the following: Youtube, United States, United Kingdom, Elizabeth II, Wikipedia, 2022 FIFA World Cup, Microsoft Office, India, Christiano Ronaldo, Cleopatra, Instagram, Facebook, and Ukraine\n",
        "\n",
        "dataset_file_path = \"data/embed_jobs_sample_data.jsonl\" # Full path - https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/embed_jobs_sample_data.jsonl\n",
        "\n",
        "ds=co.create_dataset(\n",
        "\tname='sample_file',\n",
        "\tdata=open(dataset_file_path, 'rb'),\n",
        "\tkeep_fields = ['id','wiki_id'],\n",
        "\tdataset_type=\"embed-input\"\n",
        "\t)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3lAS4bcvCplJ",
        "outputId": "5e1fcd0b-4880-43ed-d8a4-03148b2ef047"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "cohere.Dataset {\n",
            "\tid: sample-file-hca4x0\n",
            "\tname: sample_file\n",
            "\tdataset_type: embed-input\n",
            "\tvalidation_status: validated\n",
            "\tcreated_at: 2024-01-13 02:51:48.215973\n",
            "\tupdated_at: 2024-01-13 02:51:48.215973\n",
            "\tdownload_urls: ['https://storage.googleapis.com/cohere-user/dataset-api-temp/d489c39a-e152-49da-9ddc-9801bd74d823/f8bd5ab1-ccdf-45e6-9717-23b8be1ff39d/sample-file-hca4x0/000_embed_jobs_sample_data.avro?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=dataset%40cohere-production.iam.gserviceaccount.com%2F20240113%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240113T025159Z&X-Goog-Expires=14399&X-Goog-Signature=7f9ed7cb27988e4bf447167160b25940cd106859d1282919604a0d74b882681b5aaa3245f97555bb61ddca7fd9c02fb0bba8391433c70aeb2118607dfd534db444097c5ae9c8d07e66bf6723a64634f5d6f5b2500c82e351807203f5fd3c278b2584c258b4c6afc6ee191e63bcd346e3dd2c7fd4b171c2d3ddfe8e6e68c2522111b6f63e125a052be9ebe3302903f4bef9cd165c8e075b168e621c7c61177a0973bb470cc3535457078e18b338884fb303792a1ba3161ab5ca5ce2adca5676754ee1a735891ccbf64cca03d3fdcabdcccd0d201600b8c70e13487bf897f2b88424cc736f8bf8756ec835b09dacf0aa20dc8151d9249b21d4ca4de82144fcd8be&X-Goog-SignedHeaders=host']\n",
            "\tvalidation_error: None\n",
            "\tvalidation_warnings: []\n",
            "}\n"
          ]
        }
      ],
      "source": [
        "print(ds.await_validation())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QgD2QCJk9kUN"
      },
      "source": [
        "## Step 2: Create embeddings via Cohere's Embed Jobs endpoint"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "yMYoU9y55m2D",
        "outputId": "fd58014d-a9e3-4a3f-ca4b-966880dec412"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "...\n",
            "...\n"
          ]
        }
      ],
      "source": [
        "# Dataset has been uploaded, create an embed job and specify the input type as \"search document\" since this will live in your Pinecone DB\n",
        "job = co.create_embed_job(\n",
        "    dataset_id=ds.id,\n",
        "    input_type='search_document' ,\n",
        "    model='embed-english-v3.0', \n",
        "    embeddings_types=['float'])\n",
        "\n",
        "job.wait() # poll the server until the job is completed "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "LEevvN2XCUpR",
        "outputId": "d123c0a4-0c6a-4235-9ba8-658d5d3ec358"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "cohere.EmbedJob {\n",
            "\tjob_id: 792bbc1a-561b-48c2-8a97-0c80c1914ea8\n",
            "\tstatus: complete\n",
            "\tcreated_at: 2024-01-13T02:53:31.879719Z\n",
            "\tinput_dataset_id: sample-file-hca4x0\n",
            "\toutput_urls: None\n",
            "\tmodel: embed-english-v3.0\n",
            "\ttruncate: RIGHT\n",
            "\tpercent_complete: 100\n",
            "\toutput: cohere.Dataset {\n",
            "\tid: embeded-sample-file-drtjf9\n",
            "\tname: embeded-sample-file\n",
            "\tdataset_type: embed-result\n",
            "\tvalidation_status: validated\n",
            "\tcreated_at: 2024-01-13 02:53:33.569362\n",
            "\tupdated_at: 2024-01-13 02:53:33.569362\n",
            "\tdownload_urls: ['https://storage.googleapis.com/cohere-user/dataset-api-temp/d489c39a-e152-49da-9ddc-9801bd74d823/f8bd5ab1-ccdf-45e6-9717-23b8be1ff39d/embeded-sample-file-drtjf9/001_embeded-sample-file.avro?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=dataset%40cohere-production.iam.gserviceaccount.com%2F20240113%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240113T025352Z&X-Goog-Expires=14399&X-Goog-Signature=927d2d73947058c95cd73f813e87e81416163dd0ccba1d5ce064a186fdc732550a605ff2122c9087fcc8e16dded9836084be532dac59622f73cb625fa1014fa9b1f9b702ca02be50006d8e2b2239870e3c765f7f8bd4964ccc74f4d9cc28672cd22c0d47008b112bd787eaab19692a5d9d724899f6aac088facdbe6d3b9d74f929fa07e76126656a5b0635a9793d9e85e692ca2d1d40f406869a53aa2e7390570d67e602fc5ea1741686176ce857112d5d68dee1b2fc62a436cd244890ec7c5901941cb10b01c2c41d6eba67dc393370a7c47526a7272f1f70f880b5fea2f69cb4546a633440a538fad7b92fdeb28d98aec1680cafc89142053da915268ea039&X-Goog-SignedHeaders=host']\n",
            "\tvalidation_error: None\n",
            "\tvalidation_warnings: []\n",
            "}\n",
            "}\n"
          ]
        }
      ],
      "source": [
        "print(job)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZVf1njJ4952G"
      },
      "source": [
        "## Step 3: Download and prepare the embeddings"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0FdyxbcR5uci"
      },
      "outputs": [],
      "source": [
        "# Save down the output of the job\n",
        "embeddings_file_path = 'embed_jobs_output.csv'\n",
        "output_dataset=co.get_dataset(job.output.id)\n",
        "output_dataset.save(filepath=embeddings_file_path, format=\"csv\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VvheIBdJ6FC_"
      },
      "outputs": [],
      "source": [
        "# Add the results\n",
        "embeddings=[]\n",
        "texts=[]\n",
        "for record in output_dataset:\n",
        "  embeddings.append(record['embeddings']['float'])\n",
        "  texts.append(record['text'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-O-Yz8MS-DFz"
      },
      "source": [
        "## Step 4: Initialize Hnwslib index and add embeddings"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_vGfX8aQ9uD1"
      },
      "outputs": [],
      "source": [
        "# Create the hnsw index\n",
        "index = hnswlib.Index(space='ip', dim=1024)\n",
        "index.init_index(max_elements=len(embeddings), ef_construction=512, M=64)\n",
        "index.add_items(embeddings,list(range(len(embeddings))))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yN5qBoVX-M7g"
      },
      "source": [
        "## Step 5: Query the index and rerank the results"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5NWR4MJUBQHW"
      },
      "outputs": [],
      "source": [
        "# Query the Database\n",
        "query = \"What was the first youtube video about?\"\n",
        "\n",
        "# Convert the query into embeddings\n",
        "query_emb=co.embed(\n",
        "    texts=[query], model=\"embed-english-v3.0\", input_type=\"search_query\"\n",
        "        ).embeddings\n",
        "\n",
        "# Retrieve the initial results from your vector db\n",
        "doc_index = index.knn_query(query_emb, k=10)[0][0]\n",
        "\n",
        "# From the doc_index, get the text from each index and then pass the text into rerank\n",
        "docs_to_rerank = []\n",
        "for index in doc_index:\n",
        "  docs_to_rerank.append(texts[index])\n",
        "\n",
        "final_result = co.rerank(\n",
        "    query= query,\n",
        "    documents=docs_to_rerank,\n",
        "    model=\"rerank-english-v2.0\",\n",
        "    top_n=3)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Oya4CRdE-WDU"
      },
      "source": [
        "## Step 6: Display the results"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "rvs0Se2wER1a",
        "outputId": "eeb3e148-af79-46c0-ca3b-ae9facde5fbc"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Document Rank: 1, Document Index: 0\n",
            "Document: YouTube began as a venture capital–funded technology startup. Between November 2005 and April 2006, the company raised money from various investors, with Sequoia Capital, $11.5 million, and Artis Capital Management, $8 million, being the largest two. YouTube's early headquarters were situated above a pizzeria and a Japanese restaurant in San Mateo, California. In February 2005, the company activated codice_1. The first video was uploaded April 23, 2005. Titled \"Me at the zoo\", it shows co-founder Jawed Karim at the San Diego Zoo and can still be viewed on the site. In May, the company launched a public beta and by November, a Nike ad featuring Ronaldinho became the first video to reach one million total views. The site launched officially on December 15, 2005, by which time the site was receiving 8 million views a day. Clips at the time were limited to 100 megabytes, as little as 30 seconds of footage.\n",
            "Relevance Score: 0.94815\n",
            "\n",
            "\n",
            "Document Rank: 2, Document Index: 1\n",
            "Document: Karim said the inspiration for YouTube first came from the Super Bowl XXXVIII halftime show controversy when Janet Jackson's breast was briefly exposed by Justin Timberlake during the halftime show. Karim could not easily find video clips of the incident and the 2004 Indian Ocean Tsunami online, which led to the idea of a video-sharing site. Hurley and Chen said that the original idea for YouTube was a video version of an online dating service, and had been influenced by the website Hot or Not. They created posts on Craigslist asking attractive women to upload videos of themselves to YouTube in exchange for a $100 reward. Difficulty in finding enough dating videos led to a change of plans, with the site's founders deciding to accept uploads of any video.\n",
            "Relevance Score: 0.91626\n",
            "\n",
            "\n",
            "Document Rank: 3, Document Index: 2\n",
            "Document: YouTube was not the first video-sharing site on the Internet; Vimeo was launched in November 2004, though that site remained a side project of its developers from CollegeHumor at the time and did not grow much, either. The week of YouTube's launch, NBC-Universal's \"Saturday Night Live\" ran a skit \"Lazy Sunday\" by The Lonely Island. Besides helping to bolster ratings and long-term viewership for \"Saturday Night Live\", \"Lazy Sunday\"'s status as an early viral video helped establish YouTube as an important website. Unofficial uploads of the skit to YouTube drew in more than five million collective views by February 2006 before they were removed when NBCUniversal requested it two months later based on copyright concerns. Despite eventually being taken down, these duplicate uploads of the skit helped popularize YouTube's reach and led to the upload of more third-party content. The site grew rapidly; in July 2006, the company announced that more than 65,000 new videos were being uploaded every day and that the site was receiving 100 million video views per day.\n",
            "Relevance Score: 0.90665\n",
            "\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# Output Results\n",
        "for idx, r in enumerate(final_result):\n",
        "  print(f\"Document Rank: {idx + 1}, Document Index: {r.index}\")\n",
        "  print(f\"Document: {r.document['text']}\")\n",
        "  print(f\"Relevance Score: {r.relevance_score:.5f}\")\n",
        "  print(\"\\n\")"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
